Automated Medical Records Review for Mild Cognitive Impairment and Dementia

Abstract Objectives : Unstructured and structured data in electronic health records (EHR) are a rich source of information for research and quality improvement studies. However, extracting accurate information from EHR is labor-intensive. Here we introduce an automated EHR phenotyping model to identify patients with Alzheimer's Disease, related dementias (ADRD), or mild cognitive impairment (MCI). Methods : We assembled medical notes and associated International Classification of Diseases (ICD) codes and medication prescriptions from 3,626 outpatient adults from two hospitals seen between February 2015 and June 2022. Ground truth annotations regarding the presence vs. absence of a diagnosis of MCI or ADRD were determined through manual chart review. Indicators extracted from notes included the presence of keywords and phrases in unstructured clinical notes, prescriptions of medications associated with MCI/ADRD, and ICD codes associated with MCI/ADRD. We trained a regularized logistic regression model to predict the ground truth annotations. Model performance was evaluated using area under the receiver operating curve (AUROC), area under the precision-recall curve (AUPRC), accuracy, specificity, precision/positive predictive value, recall/sensitivity, and F1 score (harmonic mean of precision and recall). Results : Thirty percent of patients in the cohort carried diagnoses of MCI/ADRD based on manual review. When evaluated on a held-out test set, the best model using clinical notes, ICDs, and medications, achieved an AUROC of 0.98, an AUPRC of 0.98, an accuracy of 0.93, a sensitivity (recall) of 0.91, a specificity of 0.96, a precision of 0.96, and an F1 score of 0.93 The estimated overall accuracy for patients randomly selected from EHRs was 99.88%. Conclusion : Automated EHR phenotyping accurately identifies patients with MCI/ADRD based on clinical notes, ICD codes, and medication records. This approach holds potential for large-scale MCI/ADRD research utilizing EHR databases.


Introduction
Globally, 12% to 18% of people aged 60 or older are living with mild cognitive impairment (MCI) 1 , and 10% to 15% of individuals living with MCI develop dementia each year 2 .About one-third of people living with MCI due to Alzheimer's disease (AD) develop dementia within ve years 3 .The number of people in the United States with AD dementia will increase dramatically in the next 30 years due to growth of the population over the age of 65 4 .Observational data from Electronic Health Records (EHRs) are an increasingly important resource for research on risk factors and potential interventions for MCI and AD 5- 7 .A key challenge in scaling up EHR-based research is the accurate phenotyping of patients with a diagnosis of MCI and ADRD.Many studies rely on billing codes (International Classi cation of Diseases: ICD-9, ICD-10) 8,9 ; however, these are often inaccurate 10 .Manual review of clinical notes is more accurate [11][12][13][14][15] but labor intensive and impossible to conduct at large scale [16][17][18] .
Automated EHR phenotyping seeks to address these challenges by automatically extracting information from clinical notes and combining this with structured information (e.g., medication prescriptions and diagnostic billing codes) to infer information of interest 19 .Here, we present a machine learning (ML)based EHR phenotyping model to automate the process of chart review for MCI/ADRD.We demonstrate that our model combining information from clinical notes, ICD codes, and medications provides an accurate MCI/ADRD phenotyping and is thus suitable for large-scale EHR research.

Materials And Methods study cohort
EHR data was extracted under protocols approved by the Massachusetts General Hospital (MGH) and Beth Israel Deaconess Medical Center (BIDMC) Institutional Review Boards with waivers of informed consent.A consort diagram is provided in Figure 1.
Patients were selected from MGH and BIDMC EHR archives from visits that took place between January 3 rd , 2012, to November 3 rd , 2017.Because ADRD/MCI is an age-associated disease, only patients aged 50 years or older were included.We randomly selected patients for inclusion using a strati ed sampling strategy, to ensure adequate representation of patients with low, medium, and high likelihood of having an MCI/ADRD diagnosis to facilitate subsequent model development.Speci cally, we created 4 groups based on the presence or absence of computable criteria (i.e.criteria that do not depend on analysis of unstructured text data): 'MED-ICD-', 'MED+ICD+', 'MED+ICD-', 'MED-ICD+', denoting groups of patients with and without ICD codes and medications (MED) associated with MCI/ADRD.Notes for patients within each of these groups were subsequently manually reviewed to determine which patients had an MCI/ADRD diagnosis (described below).study data Study data included unstructured (i.e., free text) clinical notes and structured data.Structured data included International Classi cation of Diseases (ICD) codes 20 for MCI/ADRD, and dementia-related medications (see below).Clinical notes included o ce visit notes, admission notes, progress notes, discharge notes, and correspondence from all medical specialties in the MGH and BIDMC systems.Clinical notes contain a wide range of information such as chief complaint; history of present illness; physician examinations, observations, assessments, and treatment plans; active problems; and current and past medications.All patients included in the study had at least one clinical note.If a patient had any MED or ICD records, only notes recorded after the rst appearance of a MED or ICD record were selected for analysis.We removed notes with fewer than 500 words as these were generally administrative notes without signi cant medical content.

Ground truth labels: Manual chart review
We used a web-based tool developed in house that highlights keywords within notes from a provided list.

Predictors included in the model
The entire method is depicted in Figure 2. Features included as input to the model included 9 groups of medications, 6 groups of ICD codes, and text features.
Text features: Keywords, phrases, and word patterns were extracted from notes.We converted the text in each note to lowercase, removed stop words and special characters, and applied lemmatization.Subsequently, we extracted unique words from each note.We also extracted unique bigrams (two consecutive words) and trigrams (three consecutive words) to identify potentially discriminative features.For each note, we created a vector representation of the information within the note using Term Frequency-Inverse Document Frequency (TF-IDF) weighting 22,23 .TF-IDF quanti es the importance of a word within a note, relative to its prevalence across a collection of notes.Speci cally, we created vectors containing TF-IDF values for each of the candidate features identi ed above.TF-IDF features were assembled into a single overall vector which serves as input to the classi cation model.

Classi cation model training and testing
We trained a logistic regression model to assign a probability to each note, representing the likelihood of the clinical note indicating MCI/ADRD.To deal with imbalanced numbers of positive and negative subsamples, we used class weights inversely proportional to subsample size.LASSO regularization was employed for automated feature selection, with the relative importance of each feature assessed based on the magnitude of the resulting regression coe cients.To evaluate the performance of our model and assess generalizability, we implemented 5-fold strati ed nested cross-validation.Hyperparameter optimization was conducted using internal cross-validation.To compare the informativeness of different data types, we trained models with the following input combinations: (1) clinical notes, ICDs, and medications combined, (2) clinical notes only; (3) ICDs only; and (4) medications only.Feature importance analysis was performed to determine which variables had the most signi cant impact on the model's predictions.

Model performance metrics
Model performance was evaluated using accuracy, precision, recall, speci city, F1-score, area under the AUROC, and area under the AUPRC 24 .For each metric, we present micro-average performance metrics for positive and negative diagnoses of MCI/ADRD.We conducted 1000 iterations of bootstrapping to obtain the 95% con dence intervals (CI).Additionally, confusion matrices for various training and testing datasets further illustrate the model's performance across different data splits.Error analysis was conducted to identify the primary sources of misclassi cation.

Performance in unselected / random EHR samples
The sample utilized for training and testing is, by construction, enriched for "positive" cases, i.e. there are more MED+/ICD+, MED+/ICD-, and MED-/ICD+, and fewer MED-/ICD-cases than would be present in a random sample.Thus, although the overall error rate and other overall performance statistics calculated for our cohort are "biased", i.e. they do not represent the performance that we would expect in a general, unselected hospital population.To obtain an unbiased estimate of model performance, we rst estimated the error rates within each of the 4 groups (P e ++, P e +-, P e -+, P e --), then combined these with estimates of the prevalences of the 4 groups in the general hospital population to obtain an estimate of the unbiased error rate.Speci cally, obtained estimates of the prevalences of each group (p++, p+-, p-+, p--) by sampling 500 patients randomly from BIDMC and 500 from randomly from MGH, and calculated the proportions falling within each of the 4 groups.The prevalences and error rates are then combined to give an overall expected error P[E] rate using the following formula: P[E] = (P e ++ × p++) + (P e +-× p+-) + (P e -+ × p-+) + (P e --× p--)

Patient population
Figure 1 presents the CONSORT diagram illustrating the cohort selection process.The MGH cohort comprised 2,058 patients with 3,332 visits, while the BIDMC cohort included 1,819 patients with 3,479 visits.A total of 112 cases, accounting for 765 visits, were excluded due to uncertain manual annotations.After categorizing patients into four sampling groups, the nal counts were as follows: 3,626 patients with 5,612 visits.Speci cally, the 'MED+ ICD+' group had 133 patients with 1,751 visits; the 'MED+ ICD-' group included 466 patients with 481 visits; the 'MED-ICD+' group consisted of 214 patients with 214 visits; and the 'MED-ICD-' group contained 2,813 patients with 3,166 visits.In the subgroup analysis, the 'MED+ ICD+' group had 121 ADRD/MCI-positive patients with 1,643 visits and 12 MCI/ADRD-negative patients with 108 visits.The 'MED+ ICD-' group included 338 ADRD/MCI-positive patients with 353 visits and 121 MCI/ADRD-negative patients with 128 visits.The 'MED-ICD+' subgroup comprised 125 ADRD/MCI-positive patients with 125 visits and 89 MCI/ADRD-negative patients with 89 visits.The largest group, 'MED-ICD-', included 522 ADRD/MCI-positive patients with 592 visits and 2,291 MCI/ADRD-negative patients with 2,574 visits.The cohort was selected to ensure su cient patients with MCI/ADRD diagnoses for model training.Consequently, the nal MGH and BIDMC cohort included 1,106 (30.5%)MCI/ADRD-positive patients from 1,634 visits and 2,587 MCI/ADRD-negative patients from 3,978 visits.
Figure 6 shows the coe cient values of the top 15 features selected during model training.Notably, the presence of the word "dementia" in a note emerged as the most informative feature, followed closely by the prescription of MCI/ADRD-related medications, including donepezil, aricept, rivastigmine, and memantine.Other top-15 MCI/ADRD-related keywords included "cognitive impairment", "Alzheimer", "MCI", "memory", "cognitive", and "decline".

Error analysis
We conducted a manual review of cases to gain qualitative insights into reasons for model errors.False positives arose primarily from clinical notes describing symptoms resembling MCI/ADRD, such as memory loss and cognitive decline attributable to alternative causes such as depression and anxiety.Conversely, false negatives arose primarily from notes from specialists seeing patients with MCI/ADRD for specialized care in other areas of medicine, such as nephrology or gynecology, who commented sparsely on issues related to the MCI/ADRD diagnosis in their notes.

Discussion
Our machine learning-based automated EHR phenotyping model accurately identi es patients diagnosed with MCI/ADRD using unstructured clinical notes, ICD codes, and MCI/ADRD medications.Models incorporating textual notes, either alone or combined with ICD codes and medication data, consistently outperformed those relying solely on ICD codes or medication data.The ICD+MED+Note model achieved the highest performance, underscoring the importance of integrating diverse data sources for enhanced accuracy and reliability.As shown in Table 2, models using only ICD codes exhibit lower speci city (i.e., higher false positive rate).In contrast, models based solely on clinical notes demonstrate superior performance, with notable differences in AUROC (0.94 vs. 0.54), AUPRC (0.69 vs. 0.95), and F1-score (0.78 vs. 0.60).
Our analysis revealed an important nding that the information in clinical notes leads to much better performance than using ICD codes alone 25,26 .For example, the ICDs may be triggered by a broad range of cognitive symptoms that are not necessarily due to MCI or dementia.This is more likely in the older population where other disease conditions are common, including medication side effects or interactions, depression /mood di culties 27 , hypothyroidism, substance use (such as alcohol and marijuana), and sleep di culties 28 .The accurate diagnosis of MCI or dementia requires comprehensive testing [29][30][31] , including cognitive testing, physical examination, and often neuroimaging 32 .The clinical notes often contain more detailed descriptions and therefore more information about the ground truth.
The performance indicates good generalizability across sites.Nevertheless, there are notable site differences.Performance on MGH test data was better compared to performance on BIDMC test data, regardless of the source (BIDMC or MGH) of the training data, suggesting that the MGH dataset may present fewer complexities or challenges.Conversely, adding data from BIDMC to the training dataset resulted in a slight decline in performance on tests conducted at BIDMC.Speci cally, the AUROC decreased from 0.98 to 0.92, and the AUPRC decreased from 0.98 to 0.91.These changes represent a minimal impact on the model's performance.The alignment of the features identi ed for retention within the model with existing medical knowledge and their consistency with an established understanding of dementia and MCI/ADRD medications suggests that the model has learned a reasonable, interpretable pattern.In the future, we hope to apply this model to identify MCI/ADRD patients from electronic health records at scale, thus creating opportunities for large-scale EHR-based studies.
A strength of our approach is the use of data across two health networks (MGH and BIDMC); most published studies focus on single-site data 25,[33][34][35] .This multi-site comparison allowed for a broader validation of our model, showing consistency in performance metrics such as accuracy, speci city, and AUC across different institutional datasets.Our model achieved high AUROC (0.98), demonstrating robust discrimination across testing scenarios.We also emphasize the importance of the highly observed AUPRC, particularly as it relates to the clinical relevance in contexts where class imbalance is pronounced.The high AUPRC indicates that our model effectively identi es "rare" events, such as ADRD/MCI diagnoses, from large healthcare datasets.This strong performance in correctly identifying true positive cases is crucial for accurately diagnosing fewer common conditions with signi cant clinical implications.
Additionally, our study included a larger patient cohort than those typically reported 25,[33][34][35] , which enhances the generalizability of our ndings.References to other studies comparing site performance are scarce, making our contributions signi cant for future multicentric studies.The testing performance was generally better on MGH data than on BIDMC data, regardless of the site of origin of the training data.This suggests that there may be a larger proportion of di cult or ambiguous cases in the BIDMC test set.Nevertheless, test performance was excellent for both sites, suggesting generalizability across notes written within different medical institutions.
Our study has important limitations.While our experiments included two medical centers, these are located in the same geographic region (Boston, United States), and may thus not be representative of other US and non-US populations.Thus, future studies that utilize our model across different hospitals and EHR systems should check for performance biases that might arise due to different demographics, larger sample sizes, bias in data collection, and EHR data stored formats.An additional limitation is that our model does not identify speci c subtypes of ADRD and provides no information about the severity of ADRD or MCI.Incorporating large language models (LLMs) like GPT may enhance the feature extraction and interpretation of clinical notes by handling complex medical language and context-speci c nuances more effectively -an approach we did not use.Overall, this study represents an important step towards unlocking the vast potential of EHR data to advance our understanding of mild cognitive impairment and dementia and enables various downstream studies.
In conclusion, our model combining the clinical notes, ICD codes, and medications from the EHR system provides accurate MCI/ADRD phenotyping.In the future, this work will enable important downstream large-scale analyses to understand various aspects of MCI/ADRD.

Declarations
Generalizability experimentsTo evaluate the model's generalizability across institutions and to enhance robustness by incorporating data from both, we conducted ve experiments: 1) MGH as Training Set, BIDMC as Testing Set: We trained the model exclusively with data from MGH and tested the model on data from BIDMC.2) BIDMC as Training Set, MGH as Testing Set: We trained the model exclusively with BIDMC and tested the model on MGH data.3) MGH+ BIDMC Training Set, MGH+ BIDMC Testing Set: Training data came from both MGH and BIDMC, as did testing data.4) MGH+ BIDMC Training Set, MGH Testing Set: Training data came from both MGH and BIDMC, testing data from MGH only.5) MGH+ BIDMC Training Set, BIDMC Testing Set: Training data from both MGH and BIDMC, testing data from BIDMC only.
Author B.W. and R.T. acquired funding for this study.R.W., B.W., M.F, and H.S. designed the study.S.B., R.T., V.J., R.M., and A.G. contributed to the acquisition of patient data from the Beth Israel Deaconess Medical Center.S.Z., V.J., and A.G. contributed to the acquisition of patient data from the Massachusetts General Hospital.B.W., A.S., and R.W. contributed to annotating the training and test set.R.W. constructed models, analyzed data, interpreted and developed the evaluation methods for each stage of the project, drafted the article text, and reviewed and edited the article prior to submission.All authors critically reviewed the manuscript and approved the nal version of the manuscript.

Figures
Figures

Figure 3 Left
Figure 3

Table 1 :
Cohort characteristics.Performance results for predicting MCI/ADRD chart diagnoses are presented in Table2, which varies the model inputs, and Table3, which varies the training and testing cohorts.Table2presents the average performance for logistic regression models using ICD Only, Med Only, Note Only, and ICD+MED+Note inputs in the MGH+BIDMC training sets and MGH+BIDMC testing sets.The ndings indicate a clear pattern in the performance of logistic regression models based on different input data types.Models that incorporate textual note data, either alone or in combination with ICD codes and medication data, consistently outperform models using only ICD codes or only medication data across all performance metrics, with an accuracy of 0.89, speci city of 0.90, AUROC of 0.95, and AUPRC of 0.95.In Table3, the Figure 3 provides a comparative analysis of various training and testing approaches using ROC curves (left panel) and Precision-Recall (PR) curves (right panel).The highest ROC observed is 0.99 for the MGH training and MGH testing set.The Precision-Recall curves demonstrate the trade-off between precision and recall, with the highest Precision-Recall AUC being 0.98 for multiple model con gurations, including MGH training tested on the MGH set and MGH+BIDMC training tested on the MGH set.
(a) Age at baseline for the rst visit in the study period.(b) 'Other' includes 'unknown', 'declined to answer', 'American Indian or Alaska Native' and 'Native Hawaiian or other Paci c Islander' Model performance highest performance was observed when the MGH+ BIDMC training set was tested on the MGH set, achieving an AUROC of 0.98, an AUPRC of 0.98, an accuracy of 0.93, a speci city of 0.96, a precision of 0.96, an F1 score of 0.93 and a recall of 0.91.AUPRC: Area Under the Precision-Recall Curve, summarizes the precision and recall across different thresholds.Inputs: ICD Only: Models using only International Classi cation of Diseases codes.Med Only: Models using only medication data.

Table 3 :
Average performance and [95% con dence intervals] for logistic regression model using all features in the different testing sets Figure 4 shows the confusion matrices representing various training/testing experiments in the context of predicting MCI/ADRD.The columns correspond to predicted MCI/ADRD status, while rows represent the ground truth classi cation based on chart review.The model trained and tested on MGH data shows the highest accuracy, with 98.17% for negative and 91.48% for positive predictions.Conversely, models trained on one dataset and tested on another exhibit lower performance, particularly in positive predictions.Combining both datasets for training (MGH+BI) and testing on the combined or individual datasets yields intermediate performance.
The bootstrapping results in 95% con dence intervals are in parenthesis.ACC -accuracy, Specspeci city, AP -average precision, AUROC -Area under the receiver operating characteristic curve, AUPRC -area under the precision-recall curve.Data Sets: MGH: Data derived from Massachusetts General Hospital.BI: Data derived from Beth Israel Deaconess Medical Center.MGH+ BIDMC: Data derived from Massachusetts General Hospital and Beth Israel Deaconess Medical Center.